{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c23393bf",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/low_level/vector_store.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cbbcac09-a0ab-4b7e-9d9b-5b96ac57611d",
   "metadata": {},
   "source": [
    "# Building a (Very Simple) Vector Store from Scratch\n",
    "\n",
    "In this tutorial, we show you how to build a simple in-memory vector store that can store documents along with metadata. It will also expose a query interface that can support a variety of queries:\n",
    "- semantic search (with embedding similarity)\n",
    "- metadata filtering\n",
    "\n",
    "**NOTE**: Obviously this is not supposed to be a replacement for any actual vector store (e.g. Pinecone, Weaviate, Chroma, Qdrant, Milvus, or others within our wide range of vector store integrations). This is more to teach some key retrieval concepts, like top-k embedding search + metadata filtering.\n",
    "\n",
    "We won't be covering advanced query/retrieval concepts such as approximate nearest neighbors, sparse/hybrid search, or any of the system concepts that would be required for building an actual database."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f205323d-2003-4c5e-afa4-64fdda5b8c18",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "We load in some documents, and parse them into Node objects - chunks that are ready to be inserted into a vector store."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93871efd-e460-491b-87ef-132109c00244",
   "metadata": {},
   "source": [
    "#### Load in Documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c495e86a",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-readers-file\n",
    "%pip install llama-index-embeddings-openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "694e4a15-3736-47f6-b323-59b6c6492fad",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir data\n",
    "!wget --user-agent \"Mozilla\" \"https://arxiv.org/pdf/2307.09288.pdf\" -O \"data/llama2.pdf\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1f99ca0-aea9-4441-bace-c66d797a88db",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from llama_index.readers.file import PyMuPDFReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da320f97-b332-44cb-a0c6-7b5eb7cabaf1",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = PyMuPDFReader()\n",
    "documents = loader.load(file_path=\"./data/llama2.pdf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8cf70c0-1b67-4855-b2e2-e28fd235f8b7",
   "metadata": {},
   "source": [
    "#### Parse into Nodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ee31469-024c-44a0-bd40-bb038e422575",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.node_parser import SentenceSplitter\n",
    "\n",
    "node_parser = SentenceSplitter(chunk_size=256)\n",
    "nodes = node_parser.get_nodes_from_documents(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7641b59e-1ce4-4168-84fa-db98048e940f",
   "metadata": {},
   "source": [
    "#### Generate Embeddings for each Node"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7803b109-98e6-460a-b209-9442660318a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "\n",
    "embed_model = OpenAIEmbedding()\n",
    "for node in nodes:\n",
    "    node_embedding = embed_model.get_text_embedding(\n",
    "        node.get_content(metadata_mode=\"all\")\n",
    "    )\n",
    "    node.embedding = node_embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe942484-a399-4044-9c9f-ea175f9dbafe",
   "metadata": {},
   "source": [
    "## Build a Simple In-Memory Vector Store\n",
    "\n",
    "Now we'll build our in-memory vector store. We'll store Nodes within a simple Python dictionary. We'll start off implementing embedding search, and add metadata filters."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "808f4f0b-0a1f-4f90-b0f1-c08693998189",
   "metadata": {},
   "source": [
    "### 1. Defining the Interface\n",
    "\n",
    "We'll first define the interface for building a vector store. It contains the following items:\n",
    "\n",
    "- `get`\n",
    "- `add`\n",
    "- `delete`\n",
    "- `query`\n",
    "- `persist` (which we will not implement) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71c21211-54c4-4493-bf50-548f99b36f65",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.vector_stores.types import VectorStore\n",
    "from llama_index.core.vector_stores import (\n",
    "    VectorStoreQuery,\n",
    "    VectorStoreQueryResult,\n",
    ")\n",
    "from typing import List, Any, Optional, Dict\n",
    "from llama_index.core.schema import TextNode, BaseNode\n",
    "import os\n",
    "\n",
    "\n",
    "class BaseVectorStore(VectorStore):\n",
    "    \"\"\"Simple custom Vector Store.\n",
    "\n",
    "    Stores documents in a simple in-memory dict.\n",
    "\n",
    "    \"\"\"\n",
    "\n",
    "    stores_text: bool = True\n",
    "\n",
    "    def get(self, text_id: str) -> List[float]:\n",
    "        \"\"\"Get embedding.\"\"\"\n",
    "        pass\n",
    "\n",
    "    def add(\n",
    "        self,\n",
    "        nodes: List[BaseNode],\n",
    "    ) -> List[str]:\n",
    "        \"\"\"Add nodes to index.\"\"\"\n",
    "        pass\n",
    "\n",
    "    def delete(self, ref_doc_id: str, **delete_kwargs: Any) -> None:\n",
    "        \"\"\"\n",
    "        Delete nodes using with ref_doc_id.\n",
    "\n",
    "        Args:\n",
    "            ref_doc_id (str): The doc_id of the document to delete.\n",
    "\n",
    "        \"\"\"\n",
    "        pass\n",
    "\n",
    "    def query(\n",
    "        self,\n",
    "        query: VectorStoreQuery,\n",
    "        **kwargs: Any,\n",
    "    ) -> VectorStoreQueryResult:\n",
    "        \"\"\"Get nodes for response.\"\"\"\n",
    "        pass\n",
    "\n",
    "    def persist(self, persist_path, fs=None) -> None:\n",
    "        \"\"\"Persist the SimpleVectorStore to a directory.\n",
    "\n",
    "        NOTE: we are not implementing this for now.\n",
    "\n",
    "        \"\"\"\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73d291ef-78f7-4cee-a9e3-ee9b6b557354",
   "metadata": {},
   "source": [
    "At a high-level, we subclass our base `VectorStore` abstraction. There's no inherent reason to do this if you're just building a vector store from scratch. We do it because it makes it easy to plug into our downstream abstractions later.\n",
    "\n",
    "Let's look at some of the classes defined here.\n",
    "- `BaseNode` is simply the parent class of our core Node modules. Each Node represents a text chunk + associated metadata.\n",
    "- We also use some lower-level constructs, for instance our `VectorStoreQuery` and `VectorStoreQueryResult`. These are just lightweight dataclass containers to represent queries and results. We look at the dataclass fields below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1de927f-fb4c-48d9-a26c-2d9bfc453e36",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'query_embedding': typing.Optional[typing.List[float]],\n",
       " 'similarity_top_k': int,\n",
       " 'doc_ids': typing.Optional[typing.List[str]],\n",
       " 'node_ids': typing.Optional[typing.List[str]],\n",
       " 'query_str': typing.Optional[str],\n",
       " 'output_fields': typing.Optional[typing.List[str]],\n",
       " 'embedding_field': typing.Optional[str],\n",
       " 'mode': <enum 'VectorStoreQueryMode'>,\n",
       " 'alpha': typing.Optional[float],\n",
       " 'filters': typing.Optional[llama_index.vector_stores.types.MetadataFilters],\n",
       " 'mmr_threshold': typing.Optional[float],\n",
       " 'sparse_top_k': typing.Optional[int]}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from dataclasses import fields\n",
    "\n",
    "{f.name: f.type for f in fields(VectorStoreQuery)}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9fe73ed-ade8-4a67-ba0b-1cb195bb78ab",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'nodes': typing.Optional[typing.Sequence[llama_index.schema.BaseNode]],\n",
       " 'similarities': typing.Optional[typing.List[float]],\n",
       " 'ids': typing.Optional[typing.List[str]]}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "{f.name: f.type for f in fields(VectorStoreQueryResult)}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b94a1410-3b98-40ad-be3d-1bb79762b723",
   "metadata": {},
   "source": [
    "### 2. Defining `add`, `get`, and `delete`\n",
    "\n",
    "We add some basic capabilities to add, get, and delete from a vector store.\n",
    "\n",
    "The implementation is very simple (everything is just stored in a python dictionary)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "590a3112-4bfe-4378-89d9-f10dbd41ef02",
   "metadata": {},
   "outputs": [],
   "source": [
    "class VectorStore2(BaseVectorStore):\n",
    "    \"\"\"VectorStore2 (add/get/delete implemented).\"\"\"\n",
    "\n",
    "    stores_text: bool = True\n",
    "\n",
    "    def __init__(self) -> None:\n",
    "        \"\"\"Init params.\"\"\"\n",
    "        self.node_dict: Dict[str, BaseNode] = {}\n",
    "\n",
    "    def get(self, text_id: str) -> List[float]:\n",
    "        \"\"\"Get embedding.\"\"\"\n",
    "        return self.node_dict[text_id]\n",
    "\n",
    "    def add(\n",
    "        self,\n",
    "        nodes: List[BaseNode],\n",
    "    ) -> List[str]:\n",
    "        \"\"\"Add nodes to index.\"\"\"\n",
    "        for node in nodes:\n",
    "            self.node_dict[node.node_id] = node\n",
    "\n",
    "    def delete(self, node_id: str, **delete_kwargs: Any) -> None:\n",
    "        \"\"\"\n",
    "        Delete nodes using with node_id.\n",
    "\n",
    "        Args:\n",
    "            node_id: str\n",
    "\n",
    "        \"\"\"\n",
    "        del self.node_dict[node_id]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c31a321a-b14f-4819-a26a-25b41401905d",
   "metadata": {},
   "source": [
    "We run some basic tests just to show it works well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91a1729f-a51c-48b7-a82c-88081c16df83",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_node = TextNode(id_=\"id1\", text=\"hello world\")\n",
    "test_node2 = TextNode(id_=\"id2\", text=\"foo bar\")\n",
    "test_nodes = [test_node, test_node2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4044322-d104-47ef-b44f-440ca0ef5441",
   "metadata": {},
   "outputs": [],
   "source": [
    "vector_store = VectorStore2()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0650d3b2-0df0-4f6d-be61-6c4cc81d1479",
   "metadata": {},
   "outputs": [],
   "source": [
    "vector_store.add(test_nodes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8973354-43c1-4e2d-8209-683501361ea2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Node ID: id1\n",
      "Text: hello world\n"
     ]
    }
   ],
   "source": [
    "node = vector_store.get(\"id1\")\n",
    "print(str(node))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aaeff1d2-a609-4b64-bc4d-9c491b189c51",
   "metadata": {},
   "source": [
    "### 3.a Defining `query` (semantic search)\n",
    "\n",
    "We implement a basic version of top-k semantic search. This simply iterates through all document embeddings, and compute cosine-similarity with the query embedding. The top-k documents by cosine similarity are returned.\n",
    "\n",
    "Cosine similarity: $\\dfrac{\\vec{d}\\vec{q}}{|\\vec{d}||\\vec{q}|}$ for every document, query embedding pair $\\vec{d}$, $\\vec{p}$.\n",
    "\n",
    "**NOTE**: The top-k value is contained in the `VectorStoreQuery` container.\n",
    "\n",
    "**NOTE**: Similar to the above, we define another subclass just so we don't have to reimplement the above functions (not because this is actually good code practice)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c3902ca-9463-4920-8d64-36a5ce0a3dcf",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Tuple\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "def get_top_k_embeddings(\n",
    "    query_embedding: List[float],\n",
    "    doc_embeddings: List[List[float]],\n",
    "    doc_ids: List[str],\n",
    "    similarity_top_k: int = 5,\n",
    ") -> Tuple[List[float], List]:\n",
    "    \"\"\"Get top nodes by similarity to the query.\"\"\"\n",
    "    # dimensions: D\n",
    "    qembed_np = np.array(query_embedding)\n",
    "    # dimensions: N x D\n",
    "    dembed_np = np.array(doc_embeddings)\n",
    "    # dimensions: N\n",
    "    dproduct_arr = np.dot(dembed_np, qembed_np)\n",
    "    # dimensions: N\n",
    "    norm_arr = np.linalg.norm(qembed_np) * np.linalg.norm(\n",
    "        dembed_np, axis=1, keepdims=False\n",
    "    )\n",
    "    # dimensions: N\n",
    "    cos_sim_arr = dproduct_arr / norm_arr\n",
    "\n",
    "    # now we have the N cosine similarities for each document\n",
    "    # sort by top k cosine similarity, and return ids\n",
    "    tups = [(cos_sim_arr[i], doc_ids[i]) for i in range(len(doc_ids))]\n",
    "    sorted_tups = sorted(tups, key=lambda t: t[0], reverse=True)\n",
    "\n",
    "    sorted_tups = sorted_tups[:similarity_top_k]\n",
    "\n",
    "    result_similarities = [s for s, _ in sorted_tups]\n",
    "    result_ids = [n for _, n in sorted_tups]\n",
    "    return result_similarities, result_ids"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c6e325ec-6108-4747-b0e7-00ee248d5ff7",
   "metadata": {},
   "outputs": [],
   "source": [
    "class VectorStore3A(VectorStore2):\n",
    "    \"\"\"Implements semantic/dense search.\"\"\"\n",
    "\n",
    "    def query(\n",
    "        self,\n",
    "        query: VectorStoreQuery,\n",
    "        **kwargs: Any,\n",
    "    ) -> VectorStoreQueryResult:\n",
    "        \"\"\"Get nodes for response.\"\"\"\n",
    "\n",
    "        query_embedding = cast(List[float], query.query_embedding)\n",
    "        doc_embeddings = [n.embedding for n in self.node_dict.values()]\n",
    "        doc_ids = [n.node_id for n in self.node_dict.values()]\n",
    "\n",
    "        similarities, node_ids = get_top_k_embeddings(\n",
    "            query_embedding,\n",
    "            embeddings,\n",
    "            doc_ids,\n",
    "            similarity_top_k=query.similarity_top_k,\n",
    "        )\n",
    "        result_nodes = [self.node_dict[node_id] for node_id in node_ids]\n",
    "\n",
    "        return VectorStoreQueryResult(\n",
    "            nodes=result_nodes, similarities=similarities, ids=node_ids\n",
    "        )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4ca6994-06d1-49b9-b1f9-e4bd9bcbe2f9",
   "metadata": {},
   "source": [
    "### 3.b. Supporting Metadata Filtering\n",
    "\n",
    "The next extension is adding metadata filter support. This means that we will first filter the candidate set with documents that pass the metadata filters, and then perform semantic querying.\n",
    "\n",
    "For simplicity we use metadata filters for exact matching with an AND condition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "57637533-efaf-43cd-bff2-cae428d6e507",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.vector_stores import MetadataFilters\n",
    "from llama_index.core.schema import BaseNode\n",
    "from typing import cast\n",
    "\n",
    "\n",
    "def filter_nodes(nodes: List[BaseNode], filters: MetadataFilters):\n",
    "    filtered_nodes = []\n",
    "    for node in nodes:\n",
    "        matches = True\n",
    "        for f in filters.filters:\n",
    "            if f.key not in node.metadata:\n",
    "                matches = False\n",
    "                continue\n",
    "            if f.value != node.metadata[f.key]:\n",
    "                matches = False\n",
    "                continue\n",
    "        if matches:\n",
    "            filtered_nodes.append(node)\n",
    "    return filtered_nodes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c48feeb4-c6da-42c6-a2c8-aab407ddf652",
   "metadata": {},
   "source": [
    "We add `filter_nodes` as a first-pass over the nodes before running semantic search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77a0418c-fa51-423a-a04d-d5c3990713e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "def dense_search(query: VectorStoreQuery, nodes: List[BaseNode]):\n",
    "    \"\"\"Dense search.\"\"\"\n",
    "    query_embedding = cast(List[float], query.query_embedding)\n",
    "    doc_embeddings = [n.embedding for n in nodes]\n",
    "    doc_ids = [n.node_id for n in nodes]\n",
    "    return get_top_k_embeddings(\n",
    "        query_embedding,\n",
    "        doc_embeddings,\n",
    "        doc_ids,\n",
    "        similarity_top_k=query.similarity_top_k,\n",
    "    )\n",
    "\n",
    "\n",
    "class VectorStore3B(VectorStore2):\n",
    "    \"\"\"Implements Metadata Filtering.\"\"\"\n",
    "\n",
    "    def query(\n",
    "        self,\n",
    "        query: VectorStoreQuery,\n",
    "        **kwargs: Any,\n",
    "    ) -> VectorStoreQueryResult:\n",
    "        \"\"\"Get nodes for response.\"\"\"\n",
    "        # 1. First filter by metadata\n",
    "        nodes = self.node_dict.values()\n",
    "        if query.filters is not None:\n",
    "            nodes = filter_nodes(nodes, query.filters)\n",
    "        if len(nodes) == 0:\n",
    "            result_nodes = []\n",
    "            similarities = []\n",
    "            node_ids = []\n",
    "        else:\n",
    "            # 2. Then perform semantic search\n",
    "            similarities, node_ids = dense_search(query, nodes)\n",
    "            result_nodes = [self.node_dict[node_id] for node_id in node_ids]\n",
    "        return VectorStoreQueryResult(\n",
    "            nodes=result_nodes, similarities=similarities, ids=node_ids\n",
    "        )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c632b32e-abf1-46a6-83bd-2f0d71bfacea",
   "metadata": {},
   "source": [
    "### 4. Load Data into our Vector Store\n",
    "\n",
    "Let's load our text chunks into the vector store, and run it on different types of queries: dense search, w/ metadata filters, and more."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3bffb6ca-824f-4ed0-acda-2d4230cb990f",
   "metadata": {},
   "outputs": [],
   "source": [
    "vector_store = VectorStore3B()\n",
    "# load data into the vector stores\n",
    "vector_store.add(nodes)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "454cb5ae-342b-43c1-9e13-27968971077d",
   "metadata": {},
   "source": [
    "Define an example question and embed it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd8f3f42-9f7e-493f-812a-fc1f91464da9",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_str = \"Can you tell me about the key concepts for safety finetuning\"\n",
    "query_embedding = embed_model.get_query_embedding(query_str)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0e8ffd63-6b7d-4722-b05e-1d8045114821",
   "metadata": {},
   "source": [
    "#### Query the vector store with dense search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c46802a8-2de8-45bf-bd45-384e73d6ae57",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "----------------\n",
      "[Node ID 3f74fdf4-0e2e-473e-9b07-10c51eb62794] Similarity: 0.835677131511819\n",
      "\n",
      "total_pages: 77\n",
      "file_path: ./data/llama2.pdf\n",
      "source: 23\n",
      "\n",
      "Specifically, we use the following techniques in safety fine-tuning:\n",
      "1. Supervised Safety Fine-Tuning: We initialize by gathering adversarial prompts and safe demonstra-\n",
      "tions that are then included in the general supervised fine-tuning process (Section 3.1). This teaches\n",
      "the model to align with our safety guidelines even before RLHF, and thus lays the foundation for\n",
      "high-quality human preference data annotation.\n",
      "2. Safety RLHF: Subsequently, we integrate safety in the general RLHF pipeline described in Sec-\n",
      "tion 3.2.2. This includes training a safety-specific reward model and gathering more challenging\n",
      "adversarial prompts for rejection sampling style fine-tuning and PPO optimization.\n",
      "3. Safety Context Distillation: Finally, we refine our RLHF pipeline with context distillation (Askell\n",
      "et al., 2021b).\n",
      "----------------\n",
      "\n",
      "\n",
      "\n",
      "----------------\n",
      "[Node ID 5ad5efb3-8442-4e8a-b35a-cc3a10551dc9] Similarity: 0.827877930608312\n",
      "\n",
      "total_pages: 77\n",
      "file_path: ./data/llama2.pdf\n",
      "source: 23\n",
      "\n",
      "Benchmarks give a summary view of model capabilities and behaviors that allow us to understand general\n",
      "patterns in the model, but they do not provide a fully comprehensive view of the impact the model may have\n",
      "on people or real-world outcomes; that would require study of end-to-end product deployments. Further\n",
      "testing and mitigation should be done to understand bias and other social issues for the specific context\n",
      "in which a system may be deployed. For this, it may be necessary to test beyond the groups available in\n",
      "the BOLD dataset (race, religion, and gender). As LLMs are integrated and deployed, we look forward to\n",
      "continuing research that will amplify their potential for positive impact on these important social issues.\n",
      "4.2\n",
      "Safety Fine-Tuning\n",
      "In this section, we describe our approach to safety fine-tuning, including safety categories, annotation\n",
      "guidelines, and the techniques we use to mitigate safety risks. We employ a process similar to the general\n",
      "fine-tuning methods as described in Section 3, with some notable differences related to safety concerns.\n",
      "----------------\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "query_obj = VectorStoreQuery(\n",
    "    query_embedding=query_embedding, similarity_top_k=2\n",
    ")\n",
    "\n",
    "query_result = vector_store.query(query_obj)\n",
    "for similarity, node in zip(query_result.similarities, query_result.nodes):\n",
    "    print(\n",
    "        \"\\n----------------\\n\"\n",
    "        f\"[Node ID {node.node_id}] Similarity: {similarity}\\n\\n\"\n",
    "        f\"{node.get_content(metadata_mode='all')}\"\n",
    "        \"\\n----------------\\n\\n\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbb40f1a-b414-44b8-b16d-7942f57d3975",
   "metadata": {},
   "source": [
    "#### Query the vector store with dense search + Metadata Filters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2b02e55-39a0-4cfa-b14a-f35346e77364",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "----------------\n",
      "[Node ID efe54bc0-4f9f-49ad-9dd5-900395a092fa] Similarity: 0.8190195580569283\n",
      "\n",
      "total_pages: 77\n",
      "file_path: ./data/llama2.pdf\n",
      "source: 24\n",
      "\n",
      "4.2.2\n",
      "Safety Supervised Fine-Tuning\n",
      "In accordance with the established guidelines from Section 4.2.1, we gather prompts and demonstrations\n",
      "of safe model responses from trained annotators, and use the data for supervised fine-tuning in the same\n",
      "manner as described in Section 3.1. An example can be found in Table 5.\n",
      "The annotators are instructed to initially come up with prompts that they think could potentially induce\n",
      "the model to exhibit unsafe behavior, i.e., perform red teaming, as defined by the guidelines. Subsequently,\n",
      "annotators are tasked with crafting a safe and helpful response that the model should produce.\n",
      "4.2.3\n",
      "Safety RLHF\n",
      "We observe early in the development of Llama 2-Chat that it is able to generalize from the safe demonstrations\n",
      "in supervised fine-tuning. The model quickly learns to write detailed safe responses, address safety concerns,\n",
      "explain why the topic might be sensitive, and provide additional helpful information.\n",
      "----------------\n",
      "\n",
      "\n",
      "\n",
      "----------------\n",
      "[Node ID 619c884b-cdbc-44b2-aec0-2692b44740ee] Similarity: 0.8010811332867503\n",
      "\n",
      "total_pages: 77\n",
      "file_path: ./data/llama2.pdf\n",
      "source: 24\n",
      "\n",
      "In particular, when\n",
      "the model outputs safe responses, they are often more detailed than what the average annotator writes.\n",
      "Therefore, after gathering only a few thousand supervised demonstrations, we switched entirely to RLHF to\n",
      "teach the model how to write more nuanced responses. Comprehensive tuning with RLHF has the added\n",
      "benefit that it may make the model more robust to jailbreak attempts (Bai et al., 2022a).\n",
      "We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: annotators\n",
      "write a prompt that they believe can elicit unsafe behavior, and then compare multiple model responses to\n",
      "the prompts, selecting the response that is safest according to a set of guidelines. We then use the human\n",
      "preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to\n",
      "sample from the model during the RLHF stage.\n",
      "Better Long-Tail Safety Robustness without Hurting Helpfulness\n",
      "Safety is inherently a long-tail problem,\n",
      "where the challenge comes from a small number of very specific cases.\n",
      "----------------\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# filters = MetadataFilters(\n",
    "#     filters=[\n",
    "#         ExactMatchFilter(key=\"page\", value=3)\n",
    "#     ]\n",
    "# )\n",
    "filters = MetadataFilters.from_dict({\"source\": \"24\"})\n",
    "\n",
    "query_obj = VectorStoreQuery(\n",
    "    query_embedding=query_embedding, similarity_top_k=2, filters=filters\n",
    ")\n",
    "\n",
    "query_result = vector_store.query(query_obj)\n",
    "for similarity, node in zip(query_result.similarities, query_result.nodes):\n",
    "    print(\n",
    "        \"\\n----------------\\n\"\n",
    "        f\"[Node ID {node.node_id}] Similarity: {similarity}\\n\\n\"\n",
    "        f\"{node.get_content(metadata_mode='all')}\"\n",
    "        \"\\n----------------\\n\\n\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97df839d-379e-4d19-a760-00998cf0e482",
   "metadata": {},
   "source": [
    "## Build a RAG System with the Vector Store\n",
    "\n",
    "Now that we've built the RAG system, it's time to plug it into our downstream system! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdd6b6b3-9942-464d-ba3b-6b5ddf6aedd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import VectorStoreIndex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3354341-be69-43f0-8496-76fc32dad64e",
   "metadata": {},
   "outputs": [],
   "source": [
    "index = VectorStoreIndex.from_vector_store(vector_store)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1619f80b-e407-42e0-9efe-46dcc80e8624",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine = index.as_query_engine()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13443687-d779-4542-a795-552b42cdf116",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_str = \"Can you tell me about the key concepts for safety finetuning\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c0e7b03-65b4-4b16-923a-0f059388e4de",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = query_engine.query(query_str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e0dd1457-732e-456c-9075-19ea256a63fb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The key concepts for safety fine-tuning include supervised safety fine-tuning, safety RLHF (Reinforcement Learning from Human Feedback), and safety context distillation. Supervised safety fine-tuning involves gathering adversarial prompts and safe demonstrations to align the model with safety guidelines before RLHF. Safety RLHF integrates safety into the RLHF pipeline by training a safety-specific reward model and gathering more challenging adversarial prompts for fine-tuning and optimization. Finally, safety context distillation is used to refine the RLHF pipeline. These techniques aim to mitigate safety risks and ensure that the model aligns with safety guidelines.\n"
     ]
    }
   ],
   "source": [
    "print(str(response))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe8647e5-2338-4a50-835e-a31e32118a5a",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "That's it! We've built a simple in-memory vector store that supports very simple inserts, gets, deletes, and supports dense search and metadata filtering. This can then be plugged into the rest of LlamaIndex abstractions.\n",
    "\n",
    "It doesn't support sparse search yet and is obviously not meant to be used in any sort of actual app. But this should expose some of what's going on under the hood! "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_index_v2",
   "language": "python",
   "name": "llama_index_v2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
